The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
!pip install scikit-learn-extra
Collecting scikit-learn-extra
Downloading scikit_learn_extra-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 9.7 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.25.2)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.11.4)
Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.4.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.4.0)
Installing collected packages: scikit-learn-extra
Successfully installed scikit-learn-extra-0.3.0
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Importing clustering algorithms
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
# Silhouette score
from sklearn.metrics import silhouette_score
from statistics import mode
import warnings
warnings.filterwarnings("ignore")
data=pd.read_csv("https://github.com/jesusina/clustering_PCA/blob/main/marketing_campaign.csv?raw=true")
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
We have 24 missing values for income. We'll fill them in with normally distributed values matching the dataset, using median instead of mean to protect against outliers.
meanIncome, stdIncome=data['Income'].median(), data['Income'].std()/4
indices_to_be_updated=data[data['Income'].isnull()].index
data.loc[indices_to_be_updated, 'Income']=np.random.normal(meanIncome, stdIncome, len(indices_to_be_updated))
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2240 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
Let's check for duplicates:
len(data['ID'].unique())
2240
data['Marital_Status'].value_counts()
Marital_Status Married 864 Together 580 Single 480 Divorced 232 Widow 77 Alone 3 Absurd 2 YOLO 2 Name: count, dtype: int64
data['Education'].value_counts()
Education Graduation 1127 PhD 486 Master 370 2n Cycle 203 Basic 54 Name: count, dtype: int64
Attempting to interpret these categories.
data.groupby('Education')['Income'].mean()
Education 2n Cycle 47722.176854 Basic 20306.259259 Graduation 52720.161614 Master 52895.744529 PhD 56123.641267 Name: Income, dtype: float64
sns.boxplot(data, x='Education', y='Income')
<Axes: xlabel='Education', ylabel='Income'>
We have a single outlier in income, roughly 3 times that of any other data points. Who is this person?
data[data['Income']>200000]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2233 | 9432 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 rows × 27 columns
This person was likely not reporting their income seriously. Because this person has other data, let's keep this person, but change their income to the average for their "Graduation" education level.
data.at[2233,'Income']=data.groupby('Education')['Income'].median()['Graduation']
sns.boxplot(data, x='Education', y='Income', hue='Marital_Status')
<Axes: xlabel='Education', ylabel='Income'>
data['Dt_Customer'].value_counts()
Dt_Customer
31-08-2012 12
12-09-2012 11
14-02-2013 11
12-05-2014 11
20-08-2013 10
..
05-08-2012 1
18-11-2012 1
25-05-2013 1
14-04-2013 1
09-01-2014 1
Name: count, Length: 663, dtype: int64
From the context of these columns, we are dealing with a grocery story with an online presence, trying to determine how to spend its marketing budget to maximize profit.
There are 2240 unique customers and no missing values in the dataset. Not counting customer ID, we have 26 columns, of which 23 are numerical. Nine of these numerical columns are dummy variables. The remaining 3 non-numerical columns include a date string (Dt_Customer), education, and marital status. These need to be treated carefully.
Dt_Customer: There are 663 unique dates on which our 2240 customers were acquired.
Education: Based on univariate analysis in income, "Basic" here seems to indicate less than a bachelors degree. "2ncycle" is the European equivalent of masters degree, though it may be worthwhile to keep around as a distinct category.
Marital Status: the responses given here indicate customers were permitted to fill out an "Other" category with a customer string response. The author of this notebook would advise the data engineers to modify this to only permit the choices: married, together, single, divorced, widow, since "alone" should be taken as single and the remaining answers of "absurd" and "YOLO" are obviously not useful. We'll deal with the treatment of these below.
Converting "Alone" to "Single":
data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Single' if x=='Alone' else x)
data['Marital_Status'].value_counts()
Marital_Status Married 864 Together 580 Single 483 Divorced 232 Widow 77 Absurd 2 YOLO 2 Name: count, dtype: int64
Let's clean up the date column
data['Dt_Customer']=pd.to_datetime(data['Dt_Customer'],dayfirst=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2240 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null datetime64[ns] 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(23), object(2) memory usage: 472.6+ KB
data['Year'], data['Month'], data['Day']=data['Dt_Customer'].dt.year, data['Dt_Customer'].dt.month, data['Dt_Customer'].dt.day
data=data.drop(['Dt_Customer'],axis=1)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2240 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Recency 2240 non-null int64 8 MntWines 2240 non-null int64 9 MntFruits 2240 non-null int64 10 MntMeatProducts 2240 non-null int64 11 MntFishProducts 2240 non-null int64 12 MntSweetProducts 2240 non-null int64 13 MntGoldProds 2240 non-null int64 14 NumDealsPurchases 2240 non-null int64 15 NumWebPurchases 2240 non-null int64 16 NumCatalogPurchases 2240 non-null int64 17 NumStorePurchases 2240 non-null int64 18 NumWebVisitsMonth 2240 non-null int64 19 AcceptedCmp3 2240 non-null int64 20 AcceptedCmp4 2240 non-null int64 21 AcceptedCmp5 2240 non-null int64 22 AcceptedCmp1 2240 non-null int64 23 AcceptedCmp2 2240 non-null int64 24 Complain 2240 non-null int64 25 Response 2240 non-null int64 26 Year 2240 non-null int32 27 Month 2240 non-null int32 28 Day 2240 non-null int32 dtypes: float64(1), int32(3), int64(23), object(2) memory usage: 481.4+ KB
data_num=data.select_dtypes(include=['int', 'float'])
fig = plt.figure(figsize = (18, 6))
sns.heatmap(data_num.corr(), annot = True);
plt.xticks(rotation = 45);
Month and day of joining have minimal correlations to our other variables, excepting an interesting connection between month and year.
import plotly.express as px
px.box(data, x='Year', y='Month', points="all")
The negative correlation is purely an artifact of the time window being considered (July 2012-June 2014). Let's drop month and day.
data=data.drop(['Month', 'Day'],axis=1)
Let's look at our YOLO and Absurd Marital_Status points. We note that Teenhome is the strongest correlate to birth year.
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Kidhome', col='Teenhome')
<seaborn.axisgrid.FacetGrid at 0x788b6f8b4e20>
Exploratory Data Analysis doesn't give much in the way of conclusions for YOLO and "Absurd" marital status. We'll assign them to be married, as the most common marital status. Also, we note outliers in Year_Birth, which we'll assign to the average Year_Birth
data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Married' if x=='YOLO' or x=="Absurd" else x)
data['Year_Birth']=data['Year_Birth'].apply(lambda x: data['Year_Birth'].mean() if x<1920 else x)
#data['Marital_Status']=data['Marital_Status'].apply(lambda x: data.groupby('Year_Birth')['Marital_Status'].mode() if x=='YOLO' or x=="Absurd" else x)
#data.groupby('Teenhome')['Year_Birth'].mean()[1]
#data[data['Year_Birth']<1920]['Year_Birth']
#data['Marital_Status']=data['Marital_Status'].apply(lambda x: 'Single' if x=='Alone' else x)
#data['Year_Birth']=data['Year_Birth'].apply(lambda x: data.groupby('Teenhome')['Year_Birth'].mean() if x<1920 else x)
#data[data['Year_Birth']<1920]['Year_Birth']=data.groupby('Kidhome')['Year_Birth'].mean()[]
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Kidhome', col='Education')
<seaborn.axisgrid.FacetGrid at 0x788b6f8b5540>
Dropping customer ID, which has been confirmed to be uncorrelated to any of our other variables.
data=data.drop(['ID'], axis=1)
data_dums=pd.get_dummies(data, drop_first=True)
data_dums.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | 1968.901526 | 11.694076 | 1940.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2240.0 | 51978.877309 | 21420.317931 | 1730.0 | 35538.75 | 51411.5 | 68289.75 | 162397.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.012946 | 0.113069 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Year | 2240.0 | 2013.028125 | 0.684554 | 2012.0 | 2013.00 | 2013.0 | 2013.00 | 2014.0 |
1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
We note a large number of dummy variable columns, where the value is equal to 0 or 1, where the mean is simply the percentage of people having that attribute (e.g. 0.9% of correspondents had complained).
2. Find out number of unique observations in each category of categorical columns? Write your findings/observations/insights
data['Marital_Status'].value_counts(), data['Marital_Status'].value_counts()/len(data)
(Marital_Status Married 868 Together 580 Single 483 Divorced 232 Widow 77 Name: count, dtype: int64, Marital_Status Married 0.387500 Together 0.258929 Single 0.215625 Divorced 0.103571 Widow 0.034375 Name: count, dtype: float64)
data['Education'].value_counts(), data['Education'].value_counts()/len(data)
(Education Graduation 1127 PhD 486 Master 370 2n Cycle 203 Basic 54 Name: count, dtype: int64, Education Graduation 0.503125 PhD 0.216964 Master 0.165179 2n Cycle 0.090625 Basic 0.024107 Name: count, dtype: float64)
The largest segment of our customer based (868 of them, 39%) are married, followed by "Together" (580, 26%) and Single (483, 22%).
The largest segment of our customer base (1127, 50%) had graduated high school, while 486 of them (22%) had PhDs.
3. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
Though 2n Cycle is essentially the European equivalent of a Masters, we leave the categories as distinct, in the event that they yield some predictive power, if nothing else than as a stand-in for country.
4. There are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?
Yes, we have treated this condition. "Alone" was changed to "Single" while "Absurd" and "YOLO" were converted to "Married" as the modal Marital_Status.
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Leading Questions:
data[data['Income']>120000]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | ... | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 1973.0 | PhD | Married | 157243.0 | 0 | 1 | 98 | 20 | 2 | 1582 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014 |
| 617 | 1976.0 | PhD | Together | 162397.0 | 1 | 1 | 31 | 85 | 1 | 16 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
| 655 | 1975.0 | Graduation | Divorced | 153924.0 | 0 | 0 | 81 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014 |
| 687 | 1982.0 | PhD | Married | 160803.0 | 0 | 0 | 21 | 55 | 16 | 1622 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2012 |
| 1300 | 1971.0 | Master | Together | 157733.0 | 1 | 0 | 37 | 39 | 1 | 9 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
| 1653 | 1977.0 | Graduation | Together | 157146.0 | 0 | 0 | 13 | 1 | 0 | 1725 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
| 2132 | 1949.0 | PhD | Married | 156924.0 | 0 | 0 | 85 | 2 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013 |
7 rows × 26 columns
These 7 outliers represent 0.3% of the data. As such, the right whisker is at the 99.7th percentile. We elect to replace these incomes with incomes normally distributed around the median.
indices_to_be_updated=data[data['Income']>120000].index
#n.b. meanIncome is actually median, as per above treatment to protect against outlier effects
np.random.seed(42)
data.loc[indices_to_be_updated, 'Income']=np.random.normal( meanIncome, stdIncome, len(indices_to_be_updated))
'''i=0
while len(data[data['Income']<0])>0:
np.random.seed(i)
data.loc[indices_to_be_updated, 'Income']=np.random.normal( meanIncome, stdIncome, len(indices_to_be_updated))
i+=1'''
"i=0\nwhile len(data[data['Income']<0])>0:\n np.random.seed(i)\n data.loc[indices_to_be_updated, 'Income']=np.random.normal( meanIncome, stdIncome, len(indices_to_be_updated))\n i+=1"
numerical_cols = [col for col in data.columns if pd.api.types.is_numeric_dtype(data[col])]
for col in numerical_cols:
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
Year_Birth Skew : -0.09
Income Skew : 0.01
Kidhome Skew : 0.64
Teenhome Skew : 0.41
Recency Skew : -0.0
MntWines Skew : 1.18
MntFruits Skew : 2.1
MntMeatProducts Skew : 2.08
MntFishProducts Skew : 1.92
MntSweetProducts Skew : 2.14
MntGoldProds Skew : 1.89
NumDealsPurchases Skew : 2.42
NumWebPurchases Skew : 1.38
NumCatalogPurchases Skew : 1.88
NumStorePurchases Skew : 0.7
NumWebVisitsMonth Skew : 0.21
AcceptedCmp3 Skew : 3.29
AcceptedCmp4 Skew : 3.24
AcceptedCmp5 Skew : 3.29
AcceptedCmp1 Skew : 3.56
AcceptedCmp2 Skew : 8.62
Complain Skew : 10.19
Response Skew : 1.97
Year Skew : -0.04
We note the following about the various columns:
col="Year_Birth"
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
Year_Birth Skew : -0.09
col="Income"
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
Income Skew : 0.01
col="MntWines"
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
MntWines Skew : 1.18
fig = plt.figure(figsize = (28, 12))
sns.heatmap(data_dums.corr(), annot = True);
plt.xticks(rotation = 70);
sns.catplot(data, y='Year_Birth', x='Marital_Status', hue='Education')
<seaborn.axisgrid.FacetGrid at 0x788b6e0b8c70>
sns.boxplot(data, y='Year_Birth', x='Marital_Status', hue='Education')
<Axes: xlabel='Marital_Status', ylabel='Year_Birth'>
Key observations are that PhDs and Masters tend to be older and Basic education tends to be younger. The 2n Cycles seem consistently a bit older than their masters counterparts, which means keeping them as separate categories is worthwhie.
sns.boxplot(data, y='Income', x='Education', hue='Marital_Status')
<Axes: xlabel='Education', ylabel='Income'>
The only notable difference in education and income is for Basic education to earn a lot less. We also note that Widows earn more, but then again they are older so this is expected (corr (Income, Year_Birth)=-0.2
Let's see who does the shopping;
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data, y='Income', x='NumWebPurchases')
plt.subplot(1, 2, 2)
sns.boxplot(data, y='Income', x='NumStorePurchases')
plt.show()
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data, x='NumWebVisitsMonth', y='Income')
plt.subplot(1, 2, 2)
sns.scatterplot(data, y='NumWebVisitsMonth', x='Income', hue="Education")
plt.tight_layout()
plt.show()
These two graphs tell interesting and complementary stories. The first says that the fewer number web visits per month, the higher income the person while the extreme right skew of customers making a lot of visits earn a lot less. The graph at right shows this as well, that only the very lowest income customers visit the website more than 10 times per month. This seems to contradict that the highest income individuals made the most online purchases.
sns.jointplot(data, x='NumWebVisitsMonth', y='NumWebPurchases' ,hue="Education");
Unfortunately, correlation is not robust to outliers, so these 9 people might be constituting the entirety of the -0.52 correlation. Let's check this.
NumWebPurchases_no_outliers=data[data['NumWebPurchases']<15]['NumWebPurchases']
NumWebVisitsMonth_no_outliers=data[data['NumWebVisitsMonth']<15]['NumWebVisitsMonth']
print(NumWebPurchases_no_outliers.corr(NumWebVisitsMonth_no_outliers))
sns.scatterplot(x=NumWebVisitsMonth_no_outliers, y=NumWebPurchases_no_outliers)
-0.010576900440938886
<Axes: xlabel='NumWebVisitsMonth', ylabel='NumWebPurchases'>
The heatmap above is a bit unwieldy, but it may prove useful to look at two subsets of it:
cols_with_Mnt = [col for col in data.columns if "Mnt" in col]
cols_with_Mnt_Edu = [col for col in data.columns if "Mnt" in col or 'Education' in col]
cols_with_accepted = [col for col in data.columns if "Accepted" in col or "Response" in col]
print(cols_with_Mnt)
print(cols_with_accepted)
['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'] ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
fig = plt.figure(figsize = (18, 6))
sns.heatmap(data[cols_with_Mnt].corr(), annot = True);
plt.xticks(rotation = 45);
sns.pairplot(data[cols_with_Mnt_Edu], corner=True, hue='Education')
<seaborn.axisgrid.PairGrid at 0x788b720c1030>
fig = plt.figure(figsize = (18, 6))
sns.heatmap(data[cols_with_accepted].corr(), annot = True);
plt.xticks(rotation = 45);
Not a whole lot of information other than responding to the last campaign was modestly correlated with responding to any prior campaign, though the third campaign also seemed oddly unique in its strategy for not being correlated to other campaigns.
means=data[cols_with_accepted].mean()
plt.xticks(rotation = 45);
plt.ylabel("Success rate")
sns.barplot(means)
<Axes: ylabel='Success rate'>
Campaign 2 was the least successful. The last campaign was the most successful. Let's break this down by demographic.
import numpy
for col in cols_with_accepted:
print(col)
demo_data=data.pivot_table(index='Marital_Status', columns='Education', values=col)
plt.figure(figsize = (5, 5)) # To resize the plot
sns.heatmap(demo_data, cmap = 'viridis', annot = True)
plt.show()
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Response
demo_data=data.pivot_table(index='Marital_Status', columns='Education', values='NumDealsPurchases')
plt.figure(figsize = (5, 5)) # To resize the plot
sns.heatmap(demo_data, cmap = 'viridis', annot = True)
plt.show()
In this section, we will first prepare our dataset for analysis.
Think About It:
new_cols=[]
#current customer age
data['Age']=2016-data['Year_Birth']
new_cols.append('Age')#keeping a running list of my new cols for EDA
data.drop('Year_Birth', axis=1, inplace=True)
#total kids
data['TotDependents']=data['Kidhome']+data['Teenhome']
new_cols.append('TotDependents')
#family size
data['Adults']=data.apply(lambda x: 1 if x['Marital_Status'] in ['Single ', 'Widow', 'Divorced'] else 2, axis=1)
data['FamilySize']=data['TotDependents']+data['Adults']
new_cols.append('FamilySize')
#total spent
data['TotSpent']=data[cols_with_Mnt].sum(axis=1)
new_cols.append('TotSpent')
#customer longevity
data['Longevity']=2016-data['Year']
new_cols.append('Longevity')
#total offers accepted
#data[cols_with_accepted].corr()>0.5
print(cols_with_accepted)
['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
data['TotAccepted']=data[cols_with_accepted].sum(axis=1)
new_cols.append('TotAccepted')
data['TotAccepted'].value_counts(normalize=True)
TotAccepted 0 0.728125 1 0.165179 2 0.063393 3 0.022768 4 0.016518 5 0.004018 Name: proportion, dtype: float64
len(data[data['NumDealsPurchases']==1])
970
#Total number purchases
data['TotNumPurchased']=data['NumCatalogPurchases']+data['NumStorePurchases']+data['NumWebPurchases']
new_cols.append('TotNumPurchased')
#Spent per purchase, Divide by 0 exception set to 0
data['AvgPerPurchase']=data.apply(lambda x: 0 if x['TotNumPurchased']==0 else x['TotSpent'] / x['TotNumPurchased'], axis=1)
new_cols.append('AvgPerPurchase')
#data[cols_with_accepted].corr()>0.5
data['AvgPerPurchase'].min()
0.0
Statistics for engineered features.
for col in new_cols:
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
Age Skew : 0.09
TotDependents Skew : 0.42
FamilySize Skew : 0.18
TotSpent Skew : 0.86
Longevity Skew : 0.04
TotAccepted Skew : 2.43
TotNumPurchased Skew : 0.3
AvgPerPurchase Skew : 20.75
Removing one outlier from AvgPerPurchase.
#data[data['AvgPerPurchase']>500]['AvgPerPurchase']=data['AvgPerPurchase'].mean()
data.loc[data['AvgPerPurchase'] > 500, 'AvgPerPurchase']=data['AvgPerPurchase'].mean()
data[data['AvgPerPurchase']>500]['AvgPerPurchase']
Series([], Name: AvgPerPurchase, dtype: float64)
#data[cols_with_accepted].corr()>0.5
print('Skew :', round(data['AvgPerPurchase'].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data['AvgPerPurchase'].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data['AvgPerPurchase'])
plt.show()
Skew : 1.3
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
Separating out behavior coluns: We note that cols_with_Mnt and cols_with_accepted are already subsets of behavior. To this we'll add: Recency, all Num__Purchases, NumWebVisitsMonth, and Complain and new columns: AvgPerPurchase,TotSpent ,TotAccepted
#data[cols_with_accepted].corr()>0.5
cols_behavior = cols_with_Mnt+cols_with_accepted+[col for col in data.columns if "Num" in col]+['TotAccepted','Complain','AvgPerPurchase','TotSpent', 'Recency']
print(cols_behavior)
['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'TotNumPurchased', 'TotAccepted', 'Complain', 'AvgPerPurchase', 'TotSpent', 'Recency']
We also need demographic columns: TotDependents, Age, FamilySize,Education, Marital_Status, Income
data_dums=pd.get_dummies(data, drop_first=True)
cols_demo=['TotDependents', 'Age', 'FamilySize', 'Income']+[cols for cols in data_dums.columns if "Status" in cols]+[cols for cols in data_dums.columns if "Education" in cols]
cols_demo_acc=['TotDependents', 'Age', 'FamilySize', 'Income','TotAccepted', "NumDealsPurchases"]+[cols for cols in data_dums.columns if "Status" in cols]+[cols for cols in data_dums.columns if "Education" in cols]
data_behavior=data[cols_behavior]
data_demo=data_dums[cols_demo]
data_demo_acc=data_dums[cols_demo_acc]
data_demo_acc.head()
| TotDependents | Age | FamilySize | Income | TotAccepted | NumDealsPurchases | Marital_Status_Married | Marital_Status_Single | Marital_Status_Together | Marital_Status_Widow | Education_Basic | Education_Graduation | Education_Master | Education_PhD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 59.0 | 2 | 58138.0 | 1 | 3 | False | True | False | False | False | True | False | False |
| 1 | 2 | 62.0 | 4 | 46344.0 | 0 | 2 | False | True | False | False | False | True | False | False |
| 2 | 0 | 51.0 | 2 | 71613.0 | 0 | 1 | False | False | True | False | False | True | False | False |
| 3 | 1 | 32.0 | 3 | 26646.0 | 0 | 2 | False | False | True | False | False | True | False | False |
| 4 | 1 | 35.0 | 3 | 58293.0 | 0 | 5 | True | False | False | False | False | False | False | True |
plt.figure(figsize = (20, 20))
sns.heatmap(data_behavior.corr(), annot = True, cmap = "YlGnBu")
plt.show()
Looking at all the behavior columns, there are some notably strong correlations:
From this, we conclude that wine and meat were more expensive than other products and that more of these items were bought through the catalog than through other means.
Meat purchases were mostly done through catalog and total number of purchases did not vary strongly by means (web, catalog, or store).
Very little substitional effect: big shoppers in one category were also big shoppers in the next category.
On the other hand, Accepted 2, Accepted 3, and Recency, and Complain were all poorly correlated with anything, we will drop them.
data_behavior.drop(columns=['AcceptedCmp2', 'AcceptedCmp3', 'Recency','Complain'], inplace=True)
data_demo_acc.head()
| TotDependents | Age | FamilySize | Income | TotAccepted | NumDealsPurchases | Marital_Status_Married | Marital_Status_Single | Marital_Status_Together | Marital_Status_Widow | Education_Basic | Education_Graduation | Education_Master | Education_PhD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 59.0 | 2 | 58138.0 | 1 | 3 | False | True | False | False | False | True | False | False |
| 1 | 2 | 62.0 | 4 | 46344.0 | 0 | 2 | False | True | False | False | False | True | False | False |
| 2 | 0 | 51.0 | 2 | 71613.0 | 0 | 1 | False | False | True | False | False | True | False | False |
| 3 | 1 | 32.0 | 3 | 26646.0 | 0 | 2 | False | False | True | False | False | True | False | False |
| 4 | 1 | 35.0 | 3 | 58293.0 | 0 | 5 | True | False | False | False | False | False | False | True |
plt.figure(figsize = (20, 20))
sns.heatmap(data_demo_acc.corr(), annot = True, cmap = "YlGnBu")
plt.show()
scaler=StandardScaler()
data_behavior_scaled=pd.DataFrame(scaler.fit_transform(data_behavior), columns=data_behavior.columns)
data_behavior_scaled.head()
| MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | Response | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | TotNumPurchased | TotAccepted | AvgPerPurchase | TotSpent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.983781 | 1.551577 | 1.679702 | 2.462147 | 1.476500 | 0.843207 | -0.28383 | -0.28014 | -0.262111 | 2.388846 | 0.349414 | 1.409304 | 2.510890 | -0.550785 | 0.693904 | 1.313544 | 0.623174 | 1.201453 | 1.679417 |
| 1 | -0.870479 | -0.636301 | -0.713225 | -0.650449 | -0.631503 | -0.729006 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | -0.168236 | -1.110409 | -0.568720 | -1.166125 | -0.130463 | -1.185022 | -0.502560 | -1.018283 | -0.961275 |
| 2 | 0.362723 | 0.570804 | -0.177032 | 1.345274 | -0.146905 | -0.038766 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | -0.685887 | 1.409304 | -0.226541 | 1.295237 | -0.542647 | 1.035926 | -0.502560 | 0.047523 | 0.282673 |
| 3 | -0.870479 | -0.560857 | -0.651187 | -0.503974 | -0.583043 | -0.748179 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | -0.168236 | -0.750450 | -0.910898 | -0.550785 | 0.281720 | -0.907403 | -0.502560 | -0.949003 | -0.918094 |
| 4 | -0.389085 | 0.419916 | -0.216914 | 0.155164 | -0.001525 | -0.556446 | -0.28383 | -0.28014 | -0.262111 | -0.418612 | 1.384715 | 0.329427 | 0.115638 | 0.064556 | -0.130463 | 0.203070 | -0.502560 | -0.240366 | -0.305254 |
data_behavior_scaled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MntWines | 2240.0 | -7.612958e-17 | 1.000223 | -0.903167 | -0.832592 | -0.387599 | 0.595248 | 3.533390 |
| MntFruits | 2240.0 | -2.379049e-17 | 1.000223 | -0.661449 | -0.636301 | -0.460265 | 0.168436 | 4.343008 |
| MntMeatProducts | 2240.0 | 4.123686e-17 | 1.000223 | -0.739813 | -0.668912 | -0.442913 | 0.288259 | 6.904261 |
| MntFishProducts | 2240.0 | 1.506731e-17 | 1.000223 | -0.687068 | -0.632140 | -0.467355 | 0.228401 | 4.055064 |
| MntSweetProducts | 2240.0 | -1.506731e-17 | 1.000223 | -0.655733 | -0.631503 | -0.461894 | 0.143854 | 5.716737 |
| MntGoldProds | 2240.0 | -1.110223e-17 | 1.000223 | -0.844046 | -0.671486 | -0.383886 | 0.229660 | 6.096698 |
| AcceptedCmp4 | 2240.0 | 5.551115e-17 | 1.000223 | -0.283830 | -0.283830 | -0.283830 | -0.283830 | 3.523233 |
| AcceptedCmp5 | 2240.0 | -3.013462e-17 | 1.000223 | -0.280140 | -0.280140 | -0.280140 | -0.280140 | 3.569640 |
| AcceptedCmp1 | 2240.0 | 6.264830e-17 | 1.000223 | -0.262111 | -0.262111 | -0.262111 | -0.262111 | 3.815174 |
| Response | 2240.0 | -7.295751e-17 | 1.000223 | -0.418612 | -0.418612 | -0.418612 | -0.418612 | 2.388846 |
| NumDealsPurchases | 2240.0 | -9.833404e-17 | 1.000223 | -1.203537 | -0.685887 | -0.168236 | 0.349414 | 6.561217 |
| NumWebPurchases | 2240.0 | -5.709718e-17 | 1.000223 | -1.470368 | -0.750450 | -0.030532 | 0.689386 | 8.248526 |
| NumCatalogPurchases | 2240.0 | 9.516197e-18 | 1.000223 | -0.910898 | -0.910898 | -0.226541 | 0.457817 | 8.670110 |
| NumStorePurchases | 2240.0 | 7.612958e-17 | 1.000223 | -1.781466 | -0.858455 | -0.243114 | 0.679896 | 2.218248 |
| NumWebVisitsMonth | 2240.0 | -6.344132e-17 | 1.000223 | -2.191381 | -0.954831 | 0.281720 | 0.693904 | 6.052291 |
| TotNumPurchased | 2240.0 | -1.118153e-16 | 1.000223 | -1.740259 | -0.907403 | -0.074548 | 0.758307 | 2.701637 |
| TotAccepted | 2240.0 | -2.537653e-17 | 1.000223 | -0.502560 | -0.502560 | -0.502560 | 0.623174 | 5.126107 |
| AvgPerPurchase | 2240.0 | -1.110223e-16 | 1.000223 | -1.242751 | -0.810443 | -0.256993 | 0.389004 | 4.998005 |
| TotSpent | 2240.0 | 3.647876e-17 | 1.000223 | -0.997813 | -0.891937 | -0.348436 | 0.730262 | 3.187435 |
# Creating copy of the data to store labels from each algorithm
data_behavior_scaled_copy = data_behavior_scaled.copy(deep = True)
# Defining the number of principal components to generate
n = data_behavior_scaled.shape[1]
print("Max number of principal components", n)
# Finding principal components for the data
cols=['PC1','PC2']
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(data_behavior_scaled))
#transformed_data=pd.DataFrame(data_pca, columns=cols)
# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
Max number of principal components 19
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 10))
plt.plot(range(1, n+1), pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
# Find the least number of components that can explain more than 70% variance
sum = 0
for ix, i in enumerate(exp_var1):
sum = sum + i
print(i, sum)
if(sum>0.70):
print("Number of PCs that explain at least 70% variance: ", ix + 1)
break
0.43315864706841356 0.43315864706841356 0.12136306680420306 0.5545217138726166 0.09061677039937804 0.6451384842719947 0.05372784202593959 0.6988663262979343 0.039727082771396215 0.7385934090693306 Number of PCs that explain at least 70% variance: 5
colsall=[]
for i in range(1,ix+2):
colsall.append("PC"+ str(i))
print(colsall)
['PC1', 'PC2', 'PC3', 'PC4', 'PC5']
pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:ix+1], 2), index = data_behavior_scaled.columns, columns = colsall)
pc1
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| MntWines | 0.29 | 0.08 | 0.18 | 0.25 | 0.13 |
| MntFruits | 0.23 | -0.18 | -0.17 | -0.21 | -0.15 |
| MntMeatProducts | 0.28 | -0.06 | -0.17 | -0.02 | 0.40 |
| MntFishProducts | 0.24 | -0.18 | -0.18 | -0.20 | -0.17 |
| MntSweetProducts | 0.23 | -0.16 | -0.15 | -0.19 | -0.29 |
| MntGoldProds | 0.20 | -0.12 | 0.11 | -0.32 | -0.21 |
| AcceptedCmp4 | 0.10 | 0.36 | 0.17 | 0.48 | -0.19 |
| AcceptedCmp5 | 0.18 | 0.35 | -0.14 | 0.11 | 0.02 |
| AcceptedCmp1 | 0.16 | 0.32 | -0.10 | -0.09 | -0.40 |
| Response | 0.11 | 0.40 | 0.03 | -0.51 | 0.23 |
| NumDealsPurchases | -0.03 | -0.06 | 0.58 | -0.20 | 0.31 |
| NumWebPurchases | 0.20 | -0.09 | 0.45 | -0.07 | -0.31 |
| NumCatalogPurchases | 0.28 | -0.07 | -0.01 | 0.01 | 0.30 |
| NumStorePurchases | 0.25 | -0.18 | 0.19 | 0.26 | -0.15 |
| NumWebVisitsMonth | -0.20 | 0.14 | 0.38 | -0.25 | -0.08 |
| TotNumPurchased | 0.30 | -0.15 | 0.25 | 0.10 | -0.07 |
| TotAccepted | 0.19 | 0.53 | 0.00 | -0.12 | -0.06 |
| AvgPerPurchase | 0.30 | 0.02 | -0.04 | 0.01 | 0.24 |
| TotSpent | 0.34 | -0.02 | 0.01 | 0.06 | 0.16 |
Let's look at a plot of the first two of these:
pca2 = PCA(n_components = 2, random_state = 2)
data_pca = pca2.fit_transform(data_behavior_scaled)
transformed_data = pd.DataFrame(data_pca, columns = cols)
sns.scatterplot(transformed_data, x='PC1', y='PC2')
<Axes: xlabel='PC1', ylabel='PC2'>
We note that PCA's first two components - comprising 41% and 12% of the variance respectively, of the 72% explained - do produce interesting patterns that we expect to come out in our clustering algorithms.
Think about it:
Observation and Insights:
First PCA dimension: largely centered around how much the customer spends overall.
Second PCA dimension: mostly relates to how targettable the customer is to marketing campaigns, this is obviously a highly useful dimension to know.
Third PCA dimension: largely about website usage and responsiveness to deals.
The other PCA dimensions don't have an easily interpretatable meaning beyond specific products being bought (wine vs. meat respectively).
Knowing this, we can start to make sense of our plot above. We note a gap in the lower left corner of the plot for the first 2 PCA components. This corner represents the customers who would be smaller spenders but are also targetable by marketing campaigns. The emptiness of this corner reflects that these customers are not discretionary spenders and can't be recruited by marketing campaigns. The straight lines in this part of the chart reflect that response to campaigns is strongly limited by disposible income.
#trying various values of perplexity
#commenting out for reruns
'''for i in range(5, 65, 7):
tsne = TSNE(n_components = 2, random_state = 2, perplexity = i)
data_tsne = tsne.fit_transform(data_behavior_scaled)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1','X2']
plt.figure(figsize = (7,7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(i))'''
'for i in range(5, 65, 7):\n tsne = TSNE(n_components = 2, random_state = 2, perplexity = i)\n\n data_tsne = tsne.fit_transform(data_behavior_scaled)\n\n data_tsne = pd.DataFrame(data_tsne)\n\n data_tsne.columns = [\'X1\',\'X2\']\n\n plt.figure(figsize = (7,7))\n\n sns.scatterplot(x = \'X1\', y = \'X2\', data = data_tsne)\n\n plt.title("perplexity = {}".format(i))'
Observation and Insights: Most tSNE work at perplexity > 33 seems to return similiar clustering topology: one large "continent" in the lower X1 values (bound by x2=-5/2*x1+80) and a chain of maybe 3 main islands for larger X2. Let's stick with perplexity =33.
tsne = TSNE(n_components = 2, random_state = 2, perplexity = 33)
data_tsne = tsne.fit_transform(data_behavior_scaled)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1','X2']
plt.figure(figsize = (7,7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(33))
Text(0.5, 1.0, 'perplexity = 33')
# Let's assign points to 4 different groups
def grouping(x):
first_component = x['X1']
second_component = x['X2']
if (second_component < -5/2*first_component + 80):
return 'group 1'
elif first_component < 40:
return 'group_2'
elif (second_component < -15):
return 'group_3'
else:
return 'group_4'
data_tsne['groups'] = data_tsne.apply(grouping, axis = 1)
# Scatter plot for two components with hue
plt.figure(figsize = (7, 7))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue = 'groups')
<Axes: xlabel='X1', ylabel='X2'>
#data_tsne.head()
data_tsne['groups'].value_counts()
groups group 1 1708 group_4 259 group_2 206 group_3 67 Name: count, dtype: int64
This highly uneven splitting of group is not ideal. Group 1, for instance, contains 76% of all customers.
all_col = data_behavior_scaled.columns[:].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_behavior_scaled[variable], x=data_tsne['groups'])
plt.tight_layout()
plt.title(variable)
plt.show()
# importing plotly
import plotly.express as px
all_col = data_demo.columns[:].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_demo[variable], x=data_tsne['groups'])
#px.box(data_demo, y=data_demo[variable], x=data_tsne['groups'], points="all")
plt.tight_layout()
plt.title(variable)
plt.show()
The strongest separation seems to happen by number of web visits:
Think About It:
# Empty dictionary to store the SSE for each value of K
sse = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm.
# Use inertia attribute from the clustering object and store the inertia value for that K
for k in range(1, 10):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
sse[k] = kmeans.inertia_
# Elbow plotsil
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
Given the sharp elbow at k=3, 3 clusters seems ideal.
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in range(2, 10):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
labels = kmeans.predict(data_pca)
sc[k] = silhouette_score(data_pca, labels)
print(silhouette_score(data_pca, labels))
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
0.5823348058088715 0.5961512301509109 0.5185893378108656 0.5147963464217481 0.5336193717245035 0.5003371759584443 0.5374491442749475 0.5410690201481569
k=3 is our best clustering for k means.
cols = ['PC1', 'PC2', 'PC3']
pca2 = PCA(n_components = 3, random_state = 1)
data_pca = pca2.fit_transform(data_behavior_scaled)
transformed_data = pd.DataFrame(data_pca, columns = cols)
transformed_data_copy = data_behavior_scaled.copy(deep = True)
kmeans = KMeans(n_clusters = 3, random_state = 1)
kmeans_trans = KMeans(n_clusters = 3, random_state = 1)
kmeans.fit(data_behavior_scaled)
kmeans_trans.fit(transformed_data)
# Adding predicted labels to the original data and the scaled data
data_behavior_scaled_copy['KMeans_Labels'] = kmeans.predict(data_behavior_scaled)
transformed_data_copy['KMeans_Labels'] = kmeans_trans.predict(transformed_data)
data['KMeans_Labels'] = kmeans.predict(data_behavior_scaled)
transformed_data['KMeans_Labels'] = kmeans_trans.predict(transformed_data)
data['KMeans_Labels'].value_counts()
KMeans_Labels 0 1217 2 803 1 220 Name: count, dtype: int64
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
<Axes: xlabel='PC1', ylabel='PC2'>
all_col = data_behavior_scaled.columns[:].tolist()
#plt.figure(figsize = (30, 50))
'''for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_behavior_scaled[variable], x=transformed_data['KMeans_Labels'])
plt.tight_layout()
plt.title(variable)
plt.show()'''
sns.boxplot(y=data_behavior_scaled['NumDealsPurchases'], x=transformed_data['KMeans_Labels'])
<Axes: xlabel='KMeans_Labels', ylabel='NumDealsPurchases'>
all_col = data_demo.columns[:].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_demo[variable], x=transformed_data['KMeans_Labels'])
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
k=3 provided 3 clusters of data, but groups 0 and 1 don't split graphically as expected. The most conspicuous split is by income: Groups 0, 1, and 2 are low, medium, and high-income, respectively.
Think About It:
Summary of each cluster: We see a similiar grouping to what tSNE produced, in that groups are most strongly stratified by web visits and one group most strongly responds to campaingns
kmedo = KMedoids(n_clusters = 3, random_state = 5)
kmedo.fit(transformed_data)
data_behavior_scaled_copy['kmedoLabels'] = kmedo.predict(transformed_data)
data['kmedoLabels'] = kmedo.predict(transformed_data)
transformed_data['kmedoLabels']=kmedo.predict(transformed_data)
data['KMeans_Labels'].value_counts()
KMeans_Labels 0 1217 2 803 1 220 Name: count, dtype: int64
data.kmedoLabels.value_counts()
kmedoLabels 2 1079 1 582 0 579 Name: count, dtype: int64
Clusters are a bit more balanced than KMeans.
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='kmedoLabels', palette='Dark2')
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.show()
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
<Axes: xlabel='PC1', ylabel='PC2'>
Similiar clustering as before, mostly by income and amount spent, but with KMedoids, the left-most group grew from from 220 to 1079,the right-most group shrank from 802 to 582, while the middle group shrank from 1217 to 579.
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in range(2, 10):
kmeds = KMedoids(n_clusters = k, random_state = 3).fit(data_pca)
labels = kmeds.predict(data_pca)
sc[k] = silhouette_score(data_pca, labels)
print(silhouette_score(data_pca, labels))
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
0.4889265476721816 0.43059969439048595 0.2960122103355577 0.2982684918570989 0.3275626390263029 0.28114812409759826 0.2724570264059652 0.2910804108671857
for col in data_behavior.columns:
sns.boxplot(x = 'kmedoLabels', y = col, data = data)
plt.show()
all_col = data_demo.columns[:-4].tolist()
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(8, 3, i + 1)
sns.boxplot(y=data_demo[variable], x=transformed_data['kmedoLabels'])
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
Summary for each cluster:
Observations and Insights:
from scipy.cluster.hierarchy import dendrogram, linkage
# The List of all linkage methods to check
methods = ['single',
'average',
'complete']
# Create a subplot image
fig, axs = plt.subplots(len(methods), 1, figsize = (20, 15))
# Enumerate through the list of all methods above, get linkage and plot dendrogram
for i, method in enumerate(methods):
Z = linkage(transformed_data, metric = 'euclidean', method = method)
dendrogram(Z, ax = axs[i]);
axs[i].set_title(f'Dendrogram ({method.capitalize()} Linkage)')
axs[i].set_ylabel('Distance')
Think about it:
Single Linkage does not cluster well in this case.
Complete Linkage does an okay job, but Average Linkage works the best for clustering separation. Cutting at a distance of 5 gives us 4 cluters.
plt.figure(figsize = (20, 7))
plt.title("Dendrograms")
dend = dendrogram(linkage(transformed_data, method = 'average'))
plt.axhline(y = 5, color = 'g', linestyle = '--')
<matplotlib.lines.Line2D at 0x788b63f8f550>
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in range(2, 10):
hierarchical = AgglomerativeClustering(n_clusters = k, affinity = 'euclidean', linkage = 'average')
hierarchical.fit(transformed_data)
labels = hierarchical.labels_
sc[k] = silhouette_score(data_pca, labels)
print(silhouette_score(data_pca, labels))
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
0.473625696192175 0.3605942512311247 0.47212100064634444 0.43642396685740936 0.42610924696389463 0.4340967897476519 0.43259411498859485 0.4177619564849529
# Clustering with 4 clusters
hierarchical = AgglomerativeClustering(n_clusters = 4, affinity = 'euclidean', linkage = 'average')
hierarchical.fit(transformed_data)
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=4)
data_behavior_scaled_copy['HCLabels'] = hierarchical.labels_
data_dums['HCLabels'] = hierarchical.labels_
data['HCLabels'] = hierarchical.labels_
transformed_data['HCLabels'] = hierarchical.labels_
data_demo['HCLabels'] = hierarchical.labels_
data.HCLabels.value_counts()
HCLabels 3 1077 1 927 0 233 2 3 Name: count, dtype: int64
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='HCLabels', palette='Dark2')
plt.show()
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='HCLabels', palette='Dark2')
<Axes: xlabel='PC1', ylabel='PC2'>
sns.scatterplot(y='PC1', x='PC3', data=transformed_data, hue='HCLabels', palette='Dark2')
<Axes: xlabel='PC3', ylabel='PC1'>
Results very similar to KMeans and KMedoids, in that we are mostly stratifying by how much the consumers spend. HC also located 3 uniquely targetable very high spenders in their own group in what we're calling "Group 2"
for col in data_behavior.columns:
sns.boxplot(x = 'HCLabels', y = col, data = data)
plt.show()
mean = round(data_dums.groupby('HCLabels').mean(), 2)
median = round(data_dums.groupby('HCLabels').median(), 2)
df_hierachical = pd.concat([mean, median], axis = 0)
df_hierachical.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_3 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median', 'group_3 Median']
#df_hierachical[data_behavior.columns].T
df_hierachical[data_behavior.columns].T
| group_0 Mean | group_1 Mean | group_2 Mean | group_3 Mean | group_0 Median | group_1 Median | group_2 Median | group_3 Median | |
|---|---|---|---|---|---|---|---|---|
| MntWines | 833.01 | 470.78 | 26.33 | 46.64 | 847.00 | 418.00 | 5.0 | 23.00 |
| MntFruits | 62.64 | 42.03 | 6.33 | 4.96 | 44.00 | 26.00 | 1.0 | 2.00 |
| MntMeatProducts | 495.08 | 251.17 | 23.33 | 23.87 | 490.00 | 183.00 | 3.0 | 15.00 |
| MntFishProducts | 88.24 | 60.06 | 3.00 | 7.25 | 69.00 | 39.00 | 1.0 | 3.00 |
| MntSweetProducts | 68.20 | 42.29 | 4.67 | 5.12 | 51.00 | 27.00 | 1.0 | 2.00 |
| MntGoldProds | 74.97 | 69.64 | 1.33 | 15.40 | 55.00 | 50.00 | 1.0 | 10.00 |
| AcceptedCmp4 | 0.36 | 0.07 | 0.00 | 0.01 | 0.00 | 0.00 | 0.0 | 0.00 |
| AcceptedCmp5 | 0.65 | 0.01 | 0.00 | 0.00 | 1.00 | 0.00 | 0.0 | 0.00 |
| AcceptedCmp1 | 0.51 | 0.03 | 0.00 | 0.00 | 1.00 | 0.00 | 0.0 | 0.00 |
| Response | 0.60 | 0.11 | 0.00 | 0.08 | 1.00 | 0.00 | 0.0 | 0.00 |
| NumDealsPurchases | 1.17 | 2.91 | 15.00 | 2.03 | 1.00 | 2.00 | 15.0 | 2.00 |
| NumWebPurchases | 5.40 | 6.00 | 0.00 | 2.16 | 5.00 | 6.00 | 0.0 | 2.00 |
| NumCatalogPurchases | 5.97 | 4.26 | 0.33 | 0.58 | 6.00 | 4.00 | 0.0 | 0.00 |
| NumStorePurchases | 8.02 | 8.13 | 0.00 | 3.31 | 8.00 | 8.00 | 0.0 | 3.00 |
| NumWebVisitsMonth | 3.17 | 4.63 | 19.33 | 6.34 | 3.00 | 5.00 | 19.0 | 7.00 |
| TotNumPurchased | 19.39 | 18.38 | 0.33 | 6.06 | 19.00 | 18.00 | 0.0 | 6.00 |
| TotAccepted | 2.37 | 0.29 | 0.00 | 0.16 | 2.00 | 0.00 | 0.0 | 0.00 |
| AvgPerPurchase | 87.18 | 51.00 | 59.33 | 14.81 | 82.11 | 42.33 | 0.0 | 12.67 |
| TotSpent | 1622.14 | 935.97 | 65.00 | 103.24 | 1650.00 | 908.00 | 9.0 | 66.00 |
Group 2 Those 3 people fitting into Group2 Deal David. Deal David is an online window shopper, waiting for a deal. He is 5 times as likely to respond to a deal as anyone in any other category. And while he might spent a lot less overall, his average spent per purchase is more than almost any one else.
Group 0 (n=233, 10%) is Heavy Spender Henry. He buys more of more products than anyone else and he is very targetable with campaigns.
Group 1 (n=927, 41%) is a bit thrifty. She's economical Esther. He spends more than Deal David from Group 2 and makes nearly as many purchases as Heavy Spender Henry, but tends to make cheaper purchases in virtually every category. Goes for deals.
Group 3 (n=1077, 48%) is the largest group, with 1077 of our customers. This is our Average Joe. Joe does not respond to campaigns, does not buy very many items or spend very much. Joe should be a last priority in marketing campaigns. Goes for deals.
for col in data_demo.columns:
sns.boxplot(x = 'HCLabels', y = col, data = data_demo)
plt.show()
Observations and Insights:
Summary of each cluster:
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for eps in np.arange(0.5, 5.5, 0.5):
dbs = DBSCAN(eps=eps)
data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'] = dbs.fit_predict(transformed_data)
labels = dbs.fit_predict(transformed_data)
#print("no of labels", len(np.unique(labels)))
if len(np.unique(labels))>1:
sc[eps] = silhouette_score(transformed_data, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Eps")
plt.ylabel("Silhouette Score")
plt.show()
Eps=1.5 seems to work best, though anything in the range 1.5-2.5 might be plausible, in case we've hit a saddle point. It might be that eps=1.5 is solely due to default min_samples choice. Let's tune min_samples similarly.
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for min_samples in range(2,20,1):
dbs = DBSCAN(eps=1.5, min_samples=min_samples)
data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'] = dbs.fit_predict(transformed_data)
labels = dbs.fit_predict(transformed_data)
print(silhouette_score(transformed_data, labels), " with min_samples", min_samples)
#print("no of labels", len(np.unique(labels)))
if len(np.unique(labels))>1:#escaping errors from hyperparameters creating too few clusters
sc[min_samples] = silhouette_score(transformed_data, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("min_samples")
plt.ylabel("Silhouette Score")
plt.show()
0.5248019274845735 with min_samples 2 0.5452855689759257 with min_samples 3 0.5447514240850904 with min_samples 4 0.5432579128696696 with min_samples 5 0.5412803283855557 with min_samples 6 0.5410223044310082 with min_samples 7 0.5404628434072999 with min_samples 8 0.5391239800276781 with min_samples 9 0.5390779377205508 with min_samples 10 0.5387896586947369 with min_samples 11 0.5374682421861766 with min_samples 12 0.5373716247061795 with min_samples 13 0.5048655837504044 with min_samples 14 0.530167407901859 with min_samples 15 0.5296913972273695 with min_samples 16 0.5283344399614528 with min_samples 17 0.5203236807002284 with min_samples 18 0.5201479190091234 with min_samples 19
min_samples=5 (which is the default) works well. min_samples=3 works slightly better but does not change clustering appearance (see below). Also, min_samples=3 might be an overfit. (Silh.=0.56)
For eps = 2.5, min_samples=20 works better (Silh. =0.56) For eps = 2, min_samples=5 again works best (Silh.=0.56)
For all the values of eps between 1.5 and 2.5, we have checked for robustness of results in clustering.
dbs = DBSCAN(eps = 1.5, min_samples=5)
data_behavior_scaled_copy['DBSLabels'] = dbs.fit_predict(transformed_data)
data_dums['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'] = dbs.fit_predict(transformed_data)
transformed_data['DBSLabels']=dbs.fit_predict(transformed_data)
data_demo['DBSLabels'] = dbs.fit_predict(transformed_data)
data['DBSLabels'].value_counts()
DBSLabels 1 1077 0 922 2 228 -1 13 Name: count, dtype: int64
Coincidentally, we also arrive at 4 groups. Let's see how it looks on the first two PCA components.
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='DBSLabels', palette='Dark2')
plt.show()
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='DBSLabels', palette='Dark2')
<Axes: xlabel='PC1', ylabel='PC2'>
Observations and Insights: Similar clustering to before, except that the DBSLabel = -1, is pulling 13 customers into their own group for no apparent reason. Because this is only the first two dimensions of PCA, maybe we're missing something in the next dimension, that the DBSCAN algorithim is picking up for this cluster.
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='DBSLabels', palette='Dark2')
plt.show()
PC3 also shows a similar scattering of this -1 category that does not admit simple explanation.
mean = round(data_dums.groupby('DBSLabels').mean(), 2)
median = round(data_dums.groupby('DBSLabels').median(), 2)
df_hierachical = pd.concat([mean, median], axis = 0)
df_hierachical.index = ['group_-1 Mean', 'group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_-1 Median', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_hierachical[data_behavior.columns].T
| group_-1 Mean | group_0 Mean | group_1 Mean | group_2 Mean | group_-1 Median | group_0 Median | group_1 Median | group_2 Median | |
|---|---|---|---|---|---|---|---|---|
| MntWines | 595.46 | 468.15 | 46.64 | 838.64 | 408.00 | 415.50 | 23.00 | 851.50 |
| MntFruits | 56.23 | 42.03 | 4.96 | 61.82 | 14.00 | 26.00 | 2.00 | 43.50 |
| MntMeatProducts | 359.00 | 248.80 | 23.87 | 500.89 | 161.00 | 181.50 | 15.00 | 500.00 |
| MntFishProducts | 51.23 | 60.03 | 7.25 | 88.74 | 12.00 | 39.00 | 3.00 | 69.00 |
| MntSweetProducts | 46.69 | 42.35 | 5.12 | 67.78 | 12.00 | 26.50 | 2.00 | 51.00 |
| MntGoldProds | 35.15 | 69.64 | 15.40 | 76.13 | 22.00 | 50.00 | 10.00 | 56.50 |
| AcceptedCmp4 | 0.31 | 0.07 | 0.01 | 0.36 | 0.00 | 0.00 | 0.00 | 0.00 |
| AcceptedCmp5 | 0.31 | 0.01 | 0.00 | 0.66 | 0.00 | 0.00 | 0.00 | 1.00 |
| AcceptedCmp1 | 0.46 | 0.03 | 0.00 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 |
| Response | 0.62 | 0.11 | 0.08 | 0.59 | 1.00 | 0.00 | 0.00 | 1.00 |
| NumDealsPurchases | 5.31 | 2.91 | 2.03 | 1.15 | 2.00 | 2.00 | 2.00 | 1.00 |
| NumWebPurchases | 4.69 | 5.99 | 2.16 | 5.43 | 4.00 | 6.00 | 2.00 | 5.00 |
| NumCatalogPurchases | 3.62 | 4.25 | 0.58 | 6.00 | 2.00 | 4.00 | 0.00 | 6.00 |
| NumStorePurchases | 5.08 | 8.13 | 3.31 | 8.08 | 6.00 | 8.00 | 3.00 | 8.00 |
| NumWebVisitsMonth | 8.15 | 4.62 | 6.34 | 3.14 | 7.00 | 5.00 | 7.00 | 3.00 |
| TotNumPurchased | 13.38 | 18.37 | 6.06 | 19.50 | 15.00 | 18.00 | 6.00 | 19.00 |
| TotAccepted | 1.92 | 0.29 | 0.16 | 2.33 | 1.00 | 0.00 | 0.00 | 2.00 |
| AvgPerPurchase | 65.32 | 50.90 | 14.81 | 87.68 | 70.74 | 42.33 | 12.67 | 82.06 |
| TotSpent | 1143.77 | 931.00 | 103.24 | 1633.99 | 1662.00 | 907.00 | 66.00 | 1654.00 |
Of course, averages will tell us nothing useful about Group -1, because while the points are scattered wildly, their averages will still settle towards centrality on each dimension (application of CLT?)
for col in data_behavior.columns:
sns.boxplot(x = 'DBSLabels', y = col, data = data)
plt.show()
for col in data_demo.columns:
sns.boxplot(x = 'DBSLabels', y = col, data = data_demo)
plt.show()
mean = round(data_demo.groupby('DBSLabels').mean(), 2)
print(mean)
TotDependents Age FamilySize Income Marital_Status_Married \
DBSLabels
-1 0.46 43.54 2.15 52268.58 0.23
0 0.80 49.56 2.65 63320.31 0.39
1 1.24 45.11 3.11 35586.95 0.39
2 0.21 46.76 2.07 80355.84 0.39
Marital_Status_Single Marital_Status_Together \
DBSLabels
-1 0.23 0.23
0 0.20 0.26
1 0.22 0.26
2 0.24 0.22
Marital_Status_Widow Education_Basic Education_Graduation \
DBSLabels
-1 0.00 0.00 0.54
0 0.04 0.00 0.51
1 0.03 0.05 0.49
2 0.06 0.00 0.52
Education_Master Education_PhD HCLabels
DBSLabels
-1 0.00 0.38 0.77
0 0.16 0.25 1.00
1 0.17 0.19 3.00
2 0.17 0.24 0.00
Summary of each cluster: These clusters aren't as illustrative. Again, we get that the highest income group (group 2) (only 228 customers) spends the most, and has the smallest family. Group 2 responds to campaigns, but not specials deals.
Group 1 is the largest, with 1077 and they have lowest income, bigger families, and buy a lot less. They don't respond to campaigns or especially to deals.
Group 0 has 922 customers. They don't respond to campaigns or especially to deals. They are middle income and spent a corresponding amount.
The mysteriously scattered Group -1 is the most educated (PhD) and the youngest. The only thing they seem to have in common is that they respond to deals and campaigns. There are 13 of them.
Three clusters have so far seems the best, but let's do one last tuning based on sil. score to confirm this.
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in np.arange(2,8):
gmm = GaussianMixture(n_components = k, random_state = 1)
gmm.fit(transformed_data)
labels = gmm.predict(transformed_data)
#print("no of labels", len(np.unique(labels)))
if len(np.unique(labels))>1:
sc[k] = silhouette_score(transformed_data, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("k")
plt.ylabel("Silhouette Score")
plt.show()
We confirm that 3 clusters is the best choice for GMM as well. (Silh.=0.56 again)
gmm = GaussianMixture(n_components = 3, random_state = 1)
gmm.fit(transformed_data)
data_behavior_scaled_copy['GmmLabels'] = gmm.predict(transformed_data)
data_behavior['GmmLabels'] = gmm.predict(transformed_data)
data['GmmLabels'] = gmm.predict(transformed_data)
data_demo['GmmLabels'] = gmm.predict(transformed_data)
transformed_data['GmmLabels'] = gmm.predict(transformed_data)
data.GmmLabels.value_counts()
GmmLabels 1 1080 0 925 2 235 Name: count, dtype: int64
Observations and Insights:
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='GmmLabels', palette='Dark2')
plt.show()
sns.scatterplot(y='PC2', x='PC1', data=transformed_data, hue='GmmLabels', palette='Dark2')
<Axes: xlabel='PC1', ylabel='PC2'>
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='KMeans_Labels', palette='Dark2')
plt.subplot(1, 2, 2)
sns.scatterplot(y='PC2', x='PC3', data=transformed_data, hue='GmmLabels', palette='Dark2')
plt.show()
Cluster 0 shrunk a bit (n=925 here, n=1217 in Kmeans, n=579 in KMedoids, n=927 in HC, n=922 in DBS), but otherwise they look very similiar.
mean = round(data_demo.groupby('GmmLabels').mean(), 2)
print(mean)
mean = round(data_behavior.groupby('GmmLabels').mean(), 2)
print(mean)
TotDependents Age FamilySize Income Marital_Status_Married \
GmmLabels
0 0.80 49.54 2.65 63292.55 0.39
1 1.24 45.10 3.11 35500.86 0.39
2 0.21 46.69 2.07 80096.02 0.39
Marital_Status_Single Marital_Status_Together \
GmmLabels
0 0.20 0.26
1 0.22 0.26
2 0.24 0.23
Marital_Status_Widow Education_Basic Education_Graduation \
GmmLabels
0 0.04 0.00 0.51
1 0.03 0.05 0.49
2 0.06 0.00 0.52
Education_Master Education_PhD HCLabels DBSLabels
GmmLabels
0 0.16 0.25 1.00 -0.00
1 0.17 0.19 3.00 0.99
2 0.16 0.24 0.01 1.91
MntWines MntFruits MntMeatProducts MntFishProducts \
GmmLabels
0 467.90 41.91 250.21 59.93
1 46.58 4.97 23.87 7.24
2 841.26 62.93 496.81 88.50
MntSweetProducts MntGoldProds AcceptedCmp4 AcceptedCmp5 \
GmmLabels
0 42.28 69.64 0.08 0.01
1 5.12 15.36 0.01 0.00
2 68.01 74.90 0.36 0.66
AcceptedCmp1 Response NumDealsPurchases NumWebPurchases \
GmmLabels
0 0.03 0.11 2.92 5.98
1 0.00 0.08 2.07 2.16
2 0.50 0.60 1.17 5.47
NumCatalogPurchases NumStorePurchases NumWebVisitsMonth \
GmmLabels
0 4.25 8.12 4.62
1 0.58 3.30 6.37
2 5.98 8.06 3.20
TotNumPurchased TotAccepted AvgPerPurchase TotSpent
GmmLabels
0 18.35 0.29 50.88 931.88
1 6.04 0.16 14.93 103.13
2 19.51 2.35 87.33 1632.41
for col in data_behavior.columns:
sns.boxplot(x = 'GmmLabels', y = col, data = data)
plt.show()
Summary of each cluster:
Observations and Insights: Very similar stories to what we've seen before.
Group 2 (n=235) here is almost exactly the same group as DBS 2: highest income, smallest families, highest spenders, more store and catalog purchases, fewer website visits. Not susceptible to deals, but are easily responsive to marketing campaigns.
Group 1 (n=1080) is again the web window shoppers, lowest income, least educated, largest families, smallest spenders, who spent a lot of time on the website without buying very much. They are a bit more likely to respond to deals but not campaigns. This is also the group 1 from DBS.
Group 0 (n=925). Intermediate income, family size, and amounts of spending highly educated, they don't respond to marketing campaigns but are the most responsive to deals.
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
2. Refined insights:
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.scatterplot(data, x='Income', y='TotAccepted', hue='NumDealsPurchases')
plt.subplot(1, 2, 2)
sns.scatterplot(data, y='NumDealsPurchases', x='Income', hue='TotAccepted')
plt.tight_layout()
plt.show()
The scatterplots above demonstrate that it is the higher income people who accept campaigns and lower income people who respond to deals.
The left plot shows an absence of lower income people who accept campaigns while showing higher income people taking fewer deals, indicated by a light color. The right plot shows a sharp dropoff on number of deals taken for higher income customers with the darker dots for higher income show an increasing receptivity to marketing campaigns. Notably, the middle income people take the most deals, as they both are price conscious while simultaneously having enough money to purchase a lot.
However, we note that that these trends are far from perfect and categorical, so simply segmenting the customer base by income will not be effective, underscoring the need for the clustering algorithims employed here.
3. Proposal for the final solution design:
-- tSNE is helpful in that it cleanly removes 1708 people who do not response to campaigns. The problem is that is does not produce any groups with strong tendency to take deals.
-- Kmeans chooses a group of 803 people who respond to campaigns, but similarly does not identify a group with strong tendency to take deals.
-- KMedoids does not create a cluster of campaignable individuals. As these will be the highest income individuals who spent the most, any final model necessarily must include individuals who respond to campaigns.
-- DBSCAN locates 228 campaignable customers and 1999 who respond to deals we but are discouraged by the hyperfocused group of 13 people (G-1). They appear both receptive to campaigns and deals, but the fact that they fail to show any kind of localization in the corresponding PCA scatterplots makes us doubtful of the model's creation of this cluster and speculate that this is an instance of overfitting.
-- Gaussian Mixture similarly gives 235 campaignable customers and 2005 who respond to deals. This is similar to Hierarchical Clustering (HC), except the HC's clusters also manages to find deals customers with high income, who correspondingly spend more money.
plt.figure(figsize=(15, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data, x='TotDependents', y='TotAccepted')
plt.subplot(1, 2, 2)
sns.boxplot(data, y='NumDealsPurchases', x='TotDependents')
plt.tight_layout()
plt.show()
!pip install nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4) Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.9.4) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.12.3) Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0) Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1) Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.4) Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.3) Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.2) Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.3.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.5) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.4) Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.10.0) Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.10.4) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (24.0) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.1) Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1) Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.2.1) Requirement already satisfied: traitlets>=5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1) Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (4.2.0) Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12) Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (2.19.1) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (4.19.2) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.5) Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (1.16.0) Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (0.5.1) Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (23.2.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (2023.12.1) Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.34.0) Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.18.0) Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (23.2.1) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2) Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.3)
import nbconvert
import os
from nbconvert import HTMLExporter
!jupyter nbconvert --to html /content/Capstone_Project_Reference_Notebook_Full_Code_Marketing_Campaign_Analysis.ipynb
[NbConvertApp] WARNING | pattern '/content/Capstone_Project_Reference_Notebook_Full_Code_Marketing_Campaign_Analysis.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
to various other formats.
WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.
Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
<cmd> --help-all
--debug
set log level to logging.DEBUG (maximize logging output)
Equivalent to: [--Application.log_level=10]
--show-config
Show the application's configuration (human-readable format)
Equivalent to: [--Application.show_config=True]
--show-config-json
Show the application's configuration (json format)
Equivalent to: [--Application.show_config_json=True]
--generate-config
generate default config file
Equivalent to: [--JupyterApp.generate_config=True]
-y
Answer yes to any questions instead of prompting.
Equivalent to: [--JupyterApp.answer_yes=True]
--execute
Execute the notebook prior to export.
Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
Write notebook output to stdout instead of files.
Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
Run nbconvert in place, overwriting the existing notebook (only
relevant when converting to notebook format)
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
Clear output of current file and save in place,
overwriting the existing notebook.
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
Exclude input and output prompts from converted document.
Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
Exclude input cells and output prompts from converted document.
This mode is ideal for generating code-free reports.
Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
Whether to allow downloading chromium if no suitable version is found on the system.
Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
Disable chromium security sandbox when converting to PDF..
Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
Shows code input. This flag is only useful for dejavu users.
Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
Whether the HTML in Markdown cells and cell outputs should be sanitized..
Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
Set the log level by value or name.
Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
Default: 30
Equivalent to: [--Application.log_level]
--config=<Unicode>
Full path of a config file.
Default: ''
Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
The export format to be used, either one of the built-in formats
['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
or a dotted object name that represents the import path for an
``Exporter`` class
Default: ''
Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
Name of the template to use
Default: ''
Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
Name of the template file to use
Default: None
Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
as prebuilt extension for the lab template)
Default: 'light'
Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
Whether the HTML in Markdown cells and cell outputs should be sanitized.This
should be set to True by nbviewer or similar tools.
Default: False
Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
Writer class used to write the
results of the conversion
Default: 'FilesWriter'
Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
PostProcessor class used to write the
results of the conversion
Default: ''
Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
overwrite base name use for output files.
can only be used when converting one notebook at a time.
Default: ''
Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
Directory to write output(s) to. Defaults
to output to the directory of each notebook. To recover
previous default behaviour (outputting to the current
working directory) use . as the flag value.
Default: ''
Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
The URL prefix for reveal.js (version 3.x).
This defaults to the reveal CDN, but can be any url pointing to a copy
of reveal.js.
For speaker notes to work, this must be a relative path to a local
copy of reveal.js: e.g., "reveal.js".
If a relative path is given, it must be a subdirectory of the
current directory (from which the server is run).
See the usage documentation
(https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
for more details.
Default: ''
Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
The nbformat version to write.
Use this to downgrade notebooks.
Choices: any of [1, 2, 3, 4]
Default: 4
Equivalent to: [--NotebookExporter.nbformat_version]
Examples
--------
The simplest way to use nbconvert is
> jupyter nbconvert mynotebook.ipynb --to html
Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].
> jupyter nbconvert --to latex mynotebook.ipynb
Both HTML and LaTeX support multiple output templates. LaTeX includes
'base', 'article' and 'report'. HTML includes 'basic', 'lab' and
'classic'. You can specify the flavor of the format used.
> jupyter nbconvert --to html --template lab mynotebook.ipynb
You can also pipe the output to stdout, rather than a file
> jupyter nbconvert mynotebook.ipynb --stdout
PDF is generated via latex
> jupyter nbconvert mynotebook.ipynb --to pdf
You can get (and serve) a Reveal.js-powered slideshow
> jupyter nbconvert myslides.ipynb --to slides --post serve
Multiple notebooks can be given at the command line in a couple of
different ways:
> jupyter nbconvert notebook*.ipynb
> jupyter nbconvert notebook1.ipynb notebook2.ipynb
or you can specify the notebooks list in a config file, containing::
c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
> jupyter nbconvert --config mycfg.py
To see all available configurables, use `--help-all`.